{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "39bbe88b",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/examples/evaluation/mt_bench_single_grading.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a44afb3b-4c42-4985-909b-f6508965fdb5",
   "metadata": {},
   "source": [
    "# Benchmarking LLM Evaluators On A Mini MT-Bench (Single Grading) `LabelledEvaluatorDataset`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53a0ea03-b7b5-47ed-8227-de416791eb6e",
   "metadata": {},
   "source": [
    "In this notebook, we'll conduct an evaluation of three different evaluators that will be judging another LLM's response for response against a user query. More specifically, we will run benchmarks using a mini version of the MT-Bench single-grading dataset. In this version, we only consider the answers on the 160 questions (i.e., 80 x 2, since there are 80 two-turn dialogues) provided by llama2-70b. The reference answers used for this benchmark are provided by GPT-4. And so, our benchmarks on these three evaluators will assess closeness to GPT-4 (actually, self-consistency for the case of GPT-4).\n",
    "\n",
    "1. GPT-3.5 (OpenAI)\n",
    "2. GPT-4 (OpenAI)\n",
    "3. Gemini-Pro (Google)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed15bedf",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-llms-openai\n",
    "%pip install llama-index-llms-cohere\n",
    "%pip install llama-index-llms-gemini"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ccb6e02-8a81-4f3c-8cc7-8d193d3689e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7d739ec7-174a-4282-9d24-f14d9845cf78",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install \"google-generativeai\" -q"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d86a307b-67c4-4455-9a54-665407a91258",
   "metadata": {},
   "source": [
    "### Load in Evaluator Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e0e9014-cfb4-4b03-bfe1-04d75c4f55e9",
   "metadata": {},
   "source": [
    "Let's load in the llama-dataset from llama-hub."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2170e0d7-fc3f-45b0-bed0-7c8b8b31ac66",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.llama_dataset import download_llama_dataset\n",
    "\n",
    "# download dataset\n",
    "evaluator_dataset, _ = download_llama_dataset(\n",
    "    \"MiniMtBenchSingleGradingDataset\", \"./mini_mt_bench_data\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f2277cdb-c855-46c7-9c7b-eedd09033e51",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>query</th>\n",
       "      <th>answer</th>\n",
       "      <th>contexts</th>\n",
       "      <th>ground_truth_answer</th>\n",
       "      <th>query_by</th>\n",
       "      <th>answer_by</th>\n",
       "      <th>ground_truth_answer_by</th>\n",
       "      <th>reference_feedback</th>\n",
       "      <th>reference_score</th>\n",
       "      <th>reference_evaluation_by</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Compose an engaging travel blog post about a r...</td>\n",
       "      <td>Title: Discovering the Aloha Spirit: A Recent ...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>ai (llama2-70b)</td>\n",
       "      <td>None</td>\n",
       "      <td>The generated answer is highly relevant to the...</td>\n",
       "      <td>5.0</td>\n",
       "      <td>ai (gpt-4)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Rewrite your previous response. Start every se...</td>\n",
       "      <td>Absolutely, here's the revised response:\\n\\nAl...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>ai (llama2-70b)</td>\n",
       "      <td>None</td>\n",
       "      <td>The generated answer is highly relevant and co...</td>\n",
       "      <td>5.0</td>\n",
       "      <td>ai (gpt-4)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Draft a professional email seeking your superv...</td>\n",
       "      <td>Subject: Seeking Feedback on Quarterly Financi...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>ai (llama2-70b)</td>\n",
       "      <td>None</td>\n",
       "      <td>The generated answer is highly relevant to the...</td>\n",
       "      <td>5.0</td>\n",
       "      <td>ai (gpt-4)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Take a moment to evaluate and critique your ow...</td>\n",
       "      <td>My response was:\\n\\n\"Subject: Seeking Feedback...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>ai (llama2-70b)</td>\n",
       "      <td>None</td>\n",
       "      <td>The generated answer is highly relevant to the...</td>\n",
       "      <td>5.0</td>\n",
       "      <td>ai (gpt-4)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Imagine you are writing a blog post comparing ...</td>\n",
       "      <td>Sure, here's an outline for a blog post compar...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>ai (llama2-70b)</td>\n",
       "      <td>None</td>\n",
       "      <td>The generated answer is highly relevant to the...</td>\n",
       "      <td>5.0</td>\n",
       "      <td>ai (gpt-4)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               query  \\\n",
       "0  Compose an engaging travel blog post about a r...   \n",
       "1  Rewrite your previous response. Start every se...   \n",
       "2  Draft a professional email seeking your superv...   \n",
       "3  Take a moment to evaluate and critique your ow...   \n",
       "4  Imagine you are writing a blog post comparing ...   \n",
       "\n",
       "                                              answer contexts  \\\n",
       "0  Title: Discovering the Aloha Spirit: A Recent ...     None   \n",
       "1  Absolutely, here's the revised response:\\n\\nAl...     None   \n",
       "2  Subject: Seeking Feedback on Quarterly Financi...     None   \n",
       "3  My response was:\\n\\n\"Subject: Seeking Feedback...     None   \n",
       "4  Sure, here's an outline for a blog post compar...     None   \n",
       "\n",
       "  ground_truth_answer query_by        answer_by ground_truth_answer_by  \\\n",
       "0                None     None  ai (llama2-70b)                   None   \n",
       "1                None     None  ai (llama2-70b)                   None   \n",
       "2                None     None  ai (llama2-70b)                   None   \n",
       "3                None     None  ai (llama2-70b)                   None   \n",
       "4                None     None  ai (llama2-70b)                   None   \n",
       "\n",
       "                                  reference_feedback  reference_score  \\\n",
       "0  The generated answer is highly relevant to the...              5.0   \n",
       "1  The generated answer is highly relevant and co...              5.0   \n",
       "2  The generated answer is highly relevant to the...              5.0   \n",
       "3  The generated answer is highly relevant to the...              5.0   \n",
       "4  The generated answer is highly relevant to the...              5.0   \n",
       "\n",
       "  reference_evaluation_by  \n",
       "0              ai (gpt-4)  \n",
       "1              ai (gpt-4)  \n",
       "2              ai (gpt-4)  \n",
       "3              ai (gpt-4)  \n",
       "4              ai (gpt-4)  "
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluator_dataset.to_pandas()[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e40453c-1d51-4421-947e-4c3b10fee786",
   "metadata": {},
   "source": [
    "### Define Our Evaluators"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "752dffac-5d23-424a-9fe3-b9e5c639602e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.evaluation import CorrectnessEvaluator\n",
    "from llama_index.llms.openai import OpenAI\n",
    "from llama_index.llms.gemini import Gemini\n",
    "from llama_index.llms.cohere import Cohere\n",
    "\n",
    "llm_gpt4 = OpenAI(temperature=0, model=\"gpt-4\")\n",
    "llm_gpt35 = OpenAI(temperature=0, model=\"gpt-3.5-turbo\")\n",
    "llm_gemini = Gemini(model=\"models/gemini-pro\", temperature=0)\n",
    "\n",
    "\n",
    "evaluators = {\n",
    "    \"gpt-4\": CorrectnessEvaluator(llm=llm_gpt4),\n",
    "    \"gpt-3.5\": CorrectnessEvaluator(llm=llm_gpt35),\n",
    "    \"gemini-pro\": CorrectnessEvaluator(llm=llm_gemini),\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01d9c8be-8a1e-44fc-b590-3f02b62d5fd2",
   "metadata": {},
   "source": [
    "### Benchmark With `EvaluatorBenchmarkerPack` (llama-pack)\n",
    "\n",
    "When using the `EvaluatorBenchmarkerPack` with a `LabelledEvaluatorDataset`, the returned benchmarks will contain values for the following quantites:\n",
    "\n",
    "- `number_examples`: The number of examples the dataset consists of.\n",
    "- `invalid_predictions`: The number of evaluations that could not yield a final evaluation (e.g., due to inability to parse the evaluation output, or an exception thrown by the LLM evaluator)\n",
    "- `correlation`: The correlation between the scores of the provided evaluator and those of the reference evaluator (in this case gpt-4).\n",
    "- `mae`: The mean absolute error between the scores of the provided evaluator and those of the reference evaluator.\n",
    "- `hamming`: The hamming distance between the scores of the provided evaluator and those of the reference evaluator.\n",
    "\n",
    "NOTE: `correlation`, `mae`, and `hamming` are all computed without invalid predictions. So, essentially these metrics are conditional ones, conditioned on the prediction being valid."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e279d1f8-af0f-4557-b836-7a2d3bb6ef59",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.llama_pack import download_llama_pack\n",
    "\n",
    "EvaluatorBenchmarkerPack = download_llama_pack(\n",
    "    \"EvaluatorBenchmarkerPack\", \"./pack\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0d9a4b7-781b-44ef-ab11-e2328b2a00e8",
   "metadata": {},
   "source": [
    "#### GPT 3.5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "950c29b1-89ca-4ded-91a0-8256da4e8b84",
   "metadata": {},
   "outputs": [],
   "source": [
    "evaluator_benchmarker = EvaluatorBenchmarkerPack(\n",
    "    evaluator=evaluators[\"gpt-3.5\"],\n",
    "    eval_dataset=evaluator_dataset,\n",
    "    show_progress=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d142745-6881-45e6-ae51-066f7b2ae1a4",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/nerdai/Projects/llama_index/docs/examples/evaluation/pack/base.py:142: UserWarning: You've set a large batch_size (>10). If using OpenAI GPT-4 as  `judge_llm` (which is the default judge_llm), you may experience a RateLimitError. Previous successful eval  responses are cached per batch. So hitting a RateLimitError would mean you'd lose all of the current batches successful  GPT-4 calls.\n",
      "  warnings.warn(\n",
      "Batch processing of predictions: 100%|████████████████████| 100/100 [00:05<00:00, 18.88it/s]\n",
      "Batch processing of predictions: 100%|██████████████████████| 60/60 [00:04<00:00, 12.26it/s]\n"
     ]
    }
   ],
   "source": [
    "gpt_3p5_benchmark_df = await evaluator_benchmarker.arun(\n",
    "    batch_size=100, sleep_time_in_seconds=0\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8300c5ce-748f-4ca4-9219-72871806cc5d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>number_examples</th>\n",
       "      <th>invalid_predictions</th>\n",
       "      <th>correlation</th>\n",
       "      <th>mae</th>\n",
       "      <th>hamming</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>gpt-3.5</th>\n",
       "      <td>160</td>\n",
       "      <td>0</td>\n",
       "      <td>0.317047</td>\n",
       "      <td>1.11875</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         number_examples  invalid_predictions  correlation      mae  hamming\n",
       "gpt-3.5              160                    0     0.317047  1.11875       27"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gpt_3p5_benchmark_df.index = [\"gpt-3.5\"]\n",
    "gpt_3p5_benchmark_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90e6cf9d-4848-456f-986f-954396939ad8",
   "metadata": {},
   "source": [
    "#### GPT-4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6445b17d-2892-4915-9d5b-e1ad6142d2b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "evaluator_benchmarker = EvaluatorBenchmarkerPack(\n",
    "    evaluator=evaluators[\"gpt-4\"],\n",
    "    eval_dataset=evaluator_dataset,\n",
    "    show_progress=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f4b269e3-9125-4305-acbf-2fdfb9f4222a",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/nerdai/Projects/llama_index/docs/examples/evaluation/pack/base.py:142: UserWarning: You've set a large batch_size (>10). If using OpenAI GPT-4 as  `judge_llm` (which is the default judge_llm), you may experience a RateLimitError. Previous successful eval  responses are cached per batch. So hitting a RateLimitError would mean you'd lose all of the current batches successful  GPT-4 calls.\n",
      "  warnings.warn(\n",
      "Batch processing of predictions: 100%|████████████████████| 100/100 [00:13<00:00,  7.26it/s]\n",
      "Batch processing of predictions: 100%|██████████████████████| 60/60 [00:10<00:00,  5.92it/s]\n"
     ]
    }
   ],
   "source": [
    "gpt_4_benchmark_df = await evaluator_benchmarker.arun(\n",
    "    batch_size=100, sleep_time_in_seconds=0\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3af7f475-e7fa-4123-984c-fe722fd6bc08",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>number_examples</th>\n",
       "      <th>invalid_predictions</th>\n",
       "      <th>correlation</th>\n",
       "      <th>mae</th>\n",
       "      <th>hamming</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>gpt-4</th>\n",
       "      <td>160</td>\n",
       "      <td>0</td>\n",
       "      <td>0.966126</td>\n",
       "      <td>0.09375</td>\n",
       "      <td>143</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       number_examples  invalid_predictions  correlation      mae  hamming\n",
       "gpt-4              160                    0     0.966126  0.09375      143"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gpt_4_benchmark_df.index = [\"gpt-4\"]\n",
    "gpt_4_benchmark_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09470187-876f-4919-8d40-7dcabd901036",
   "metadata": {},
   "source": [
    "#### Gemini Pro"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "71831f98-34ff-4175-9603-571b2a72086f",
   "metadata": {},
   "outputs": [],
   "source": [
    "evaluator_benchmarker = EvaluatorBenchmarkerPack(\n",
    "    evaluator=evaluators[\"gemini-pro\"],\n",
    "    eval_dataset=evaluator_dataset,\n",
    "    show_progress=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0f667926-ae4e-48ae-9c42-ecbc70b33536",
   "metadata": {},
   "outputs": [],
   "source": [
    "gemini_pro_benchmark_df = await evaluator_benchmarker.arun(\n",
    "    batch_size=5, sleep_time_in_seconds=0.5\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6dbf5d22-75a2-48c6-9639-e822146d79b7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>number_examples</th>\n",
       "      <th>invalid_predictions</th>\n",
       "      <th>correlation</th>\n",
       "      <th>mae</th>\n",
       "      <th>hamming</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>gemini-pro</th>\n",
       "      <td>160</td>\n",
       "      <td>1</td>\n",
       "      <td>0.295121</td>\n",
       "      <td>1.220126</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            number_examples  invalid_predictions  correlation       mae  \\\n",
       "gemini-pro              160                    1     0.295121  1.220126   \n",
       "\n",
       "            hamming  \n",
       "gemini-pro       12  "
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gemini_pro_benchmark_df.index = [\"gemini-pro\"]\n",
    "gemini_pro_benchmark_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf29da98-2e2c-453e-977b-5afae3a102bc",
   "metadata": {},
   "outputs": [],
   "source": [
    "evaluator_benchmarker.prediction_dataset.save_json(\n",
    "    \"mt_sg_gemini_predictions.json\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da3675c4-65d1-417a-88b2-585f40b5671c",
   "metadata": {},
   "source": [
    "### In Summary\n",
    "\n",
    "Putting all baselines together."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5231aad-93e3-409a-a84e-9c23857cc7ce",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>number_examples</th>\n",
       "      <th>invalid_predictions</th>\n",
       "      <th>correlation</th>\n",
       "      <th>mae</th>\n",
       "      <th>hamming</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>gpt-3.5</th>\n",
       "      <td>160</td>\n",
       "      <td>0</td>\n",
       "      <td>0.317047</td>\n",
       "      <td>1.118750</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gpt-4</th>\n",
       "      <td>160</td>\n",
       "      <td>0</td>\n",
       "      <td>0.966126</td>\n",
       "      <td>0.093750</td>\n",
       "      <td>143</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gemini-pro</th>\n",
       "      <td>160</td>\n",
       "      <td>1</td>\n",
       "      <td>0.295121</td>\n",
       "      <td>1.220126</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            number_examples  invalid_predictions  correlation       mae  \\\n",
       "gpt-3.5                 160                    0     0.317047  1.118750   \n",
       "gpt-4                   160                    0     0.966126  0.093750   \n",
       "gemini-pro              160                    1     0.295121  1.220126   \n",
       "\n",
       "            hamming  \n",
       "gpt-3.5          27  \n",
       "gpt-4           143  \n",
       "gemini-pro       12  "
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "final_benchmark = pd.concat(\n",
    "    [\n",
    "        gpt_3p5_benchmark_df,\n",
    "        gpt_4_benchmark_df,\n",
    "        gemini_pro_benchmark_df,\n",
    "    ],\n",
    "    axis=0,\n",
    ")\n",
    "final_benchmark"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80234260-8f53-4aa9-899b-85ed68bb7cda",
   "metadata": {},
   "source": [
    "From the results above, we make the following observations:\n",
    "- GPT-3.5 and Gemini-Pro seem to have similar results, with perhaps the slightes edge to GPT-3.5 in terms of closeness to GPT-4.\n",
    "- Though, both don't seem to be too close to GPT-4.\n",
    "- GPT-4 seems to be pretty consistent with itself in this benchmark."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
